A stream-of-consciousness analysis and exploration of the data.
Load the data set into R
Sample a 10,000 observations and look through the summary and data types of them
## 'data.frame': 10000 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 107874 1579 91587 23849 48589 27645 20249 97958 19565 21896 ...
## $ ListingNumber : int 601187 1024213 761118 1083410 22540 151187 305309 540132 64324 1033882 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 57080 97897 73322 100624 1746 11913 21216 46427 4956 99085 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 1 1 1 1 7 5 3 1 6 1 ...
## $ Term : int 36 36 60 36 36 36 36 36 36 60 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 4 4 4 3 5 3 4 2 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1 1 1 1 427 1303 1184 1 604 1 ...
## $ BorrowerAPR : num 0.227 0.261 0.223 0.167 0.258 ...
## $ BorrowerRate : num 0.19 0.224 0.198 0.131 0.25 ...
## $ LenderYield : num 0.18 0.213 0.188 0.121 0.245 ...
## $ EstimatedEffectiveYield : num 0.176 0.196 0.176 0.116 NA ...
## $ EstimatedLoss : num 0.065 0.1075 0.0674 0.0449 NA ...
## $ EstimatedReturn : num 0.111 0.0882 0.1091 0.0708 NA ...
## $ ProsperRating..numeric. : int 4 3 4 5 NA NA NA 3 NA 4 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 5 6 5 4 1 1 1 6 1 5 ...
## $ ProsperScore : num 5 3 5 6 NA NA NA 7 NA 5 ...
## $ ListingCategory..numeric. : int 1 1 1 3 0 0 3 7 0 1 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 28 12 33 18 1 1 17 11 12 6 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 37 37 32 61 37 52 8 2 52 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 2 2 2 2 4 8 3 2 4 2 ...
## $ EmploymentStatusDuration : int 61 199 36 302 NA 31 13 161 NA 98 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 1 1 2 2 1 1 2 1 2 1 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 1 1 1 1 2 2 1 1 2 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 1 1 159 161 1 1 568 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 57050 97955 73265 100713 1736 11944 21308 46357 5132 99121 ...
## $ CreditScoreRangeLower : int 680 720 720 720 560 660 800 660 600 660 ...
## $ CreditScoreRangeUpper : int 699 739 739 739 579 679 819 679 619 679 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 9713 5790 2227 7881 3449 4455 6174 9391 6772 7521 ...
## $ CurrentCreditLines : int 16 10 11 9 NA 6 8 6 NA 12 ...
## $ OpenCreditLines : int 16 10 10 8 NA 4 5 6 NA 12 ...
## $ TotalCreditLinespast7years : int 23 22 20 18 30 22 21 12 41 25 ...
## $ OpenRevolvingAccounts : int 5 6 7 6 2 2 2 4 6 8 ...
## $ OpenRevolvingMonthlyPayment : num 345 472 647 533 45 84 5 98 56 477 ...
## $ InquiriesLast6Months : int 0 2 0 1 3 7 1 0 6 2 ...
## $ TotalInquiries : num 2 3 3 1 12 14 2 4 19 4 ...
## $ CurrentDelinquencies : int 0 0 0 0 7 1 0 2 5 0 ...
## $ AmountDelinquent : num 0 0 0 0 NA 47 0 179 NA 0 ...
## $ DelinquenciesLast7Years : int 0 0 0 0 28 1 3 4 49 0 ...
## $ PublicRecordsLast10Years : int 0 0 0 0 2 0 0 0 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 0 0 NA 0 0 0 NA 0 ...
## $ RevolvingCreditBalance : num 7622 16308 23741 17427 NA ...
## $ BankcardUtilization : num 0.8 0.62 0.9 0.75 NA 0.84 0 0.09 NA 0.9 ...
## $ AvailableBankcardCredit : num 1691 9892 2345 3810 NA ...
## $ TotalTrades : num 19 19 20 18 NA 19 18 11 NA 20 ...
## $ TradesNeverDelinquent..percentage. : num 1 1 1 1 NA 0.8 0.94 0.66 NA 0.9 ...
## $ TradesOpenedLast6Months : num 0 0 1 0 NA 1 2 0 NA 1 ...
## $ DebtToIncomeRatio : num 0.43 0.54 0.25 0.31 0.26 0.27 0.1 0.22 0.13 0.18 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 6 3 5 7 5 5 4 7 3 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3333 6667 10406 4583 3167 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 54620 7455 53342 95577 84689 31396 20225 104350 111631 30081 ...
## $ TotalProsperLoans : int 1 NA NA NA NA NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int 21 NA NA NA NA NA NA NA NA NA ...
## $ OnTimeProsperPayments : int 21 NA NA NA NA NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int 0 NA NA NA NA NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int 0 NA NA NA NA NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num 1000 NA NA NA NA NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num 0 NA NA NA NA NA NA NA NA NA ...
## $ ScorexChangeAtTimeOfListing : int 124 NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 294 0 0 2286 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA 31 NA NA 15 NA ...
## $ LoanMonthsSinceOrigination : int 21 3 10 3 92 81 71 28 87 3 ...
## $ LoanNumber : int 68892 119639 89737 123208 1788 16437 29853 56394 5215 120923 ...
## $ LoanOriginalAmount : int 9000 10000 25000 10400 2550 13000 10000 2000 3500 25000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 1449 1807 1660 1819 130 373 577 1299 233 1812 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 15 33 16 33 17 10 11 31 26 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 42919 85223 62619 71297 7382 46727 55820 80117 13351 5608 ...
## $ MonthlyLoanPayment : num 330 384 660 351 101 ...
## $ LP_CustomerPayments : num 6600 1151 6601 702 3226 ...
## $ LP_CustomerPrincipalPayments : num 4366 610 2658 473 2557 ...
## $ LP_InterestandFees : num 2234 541 3943 230 670 ...
## $ LP_ServiceFees : num -117.5 -24.2 -198.7 -17.5 -13.1 ...
## $ LP_CollectionFees : num 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 1 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 1 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 400 0 0 0 0 ...
## $ Investors : int 164 16 2 153 18 328 203 13 48 134 ...
## ListingKey ListingNumber
## 2E4135942338729616CBB84: 2 Min. : 34
## 32943590099161153292459: 2 1st Qu.: 402053
## 43A53602342416699E07666: 2 Median : 596793
## D30235991099246051BBA60: 2 Mean : 624707
## 00033425227988088FA6752: 1 3rd Qu.: 883750
## 0006357747559732389619D: 1 Max. :1250010
## (Other) :9990
## ListingCreationDate CreditGrade Term
## 2013-09-21 11:35:49.803000000: 2 :7464 Min. :12.00
## 2013-11-01 16:35:05.490000000: 2 C : 513 1st Qu.:36.00
## 2014-01-05 13:42:23.993000000: 2 D : 403 Median :36.00
## 2014-02-02 21:53:24.147000000: 2 B : 387 Mean :40.81
## 2005-11-28 16:16:35.077000000: 1 AA : 330 3rd Qu.:36.00
## 2005-11-29 13:29:16.810000000: 1 HR : 319 Max. :60.00
## (Other) :9990 (Other): 584
## LoanStatus ClosedDate BorrowerAPR
## Current :4869 :5067 Min. :0.02659
## Completed :3420 2014-01-14 00:00:00: 15 1st Qu.:0.15549
## Chargedoff :1112 2013-03-26 00:00:00: 12 Median :0.21025
## Defaulted : 399 2014-03-04 00:00:00: 12 Mean :0.21894
## Past Due (1-15 days) : 79 2012-11-06 00:00:00: 11 3rd Qu.:0.28386
## Past Due (31-60 days): 31 2013-06-26 00:00:00: 11 Max. :0.45857
## (Other) : 90 (Other) :4872 NA's :3
## BorrowerRate LenderYield EstimatedEffectiveYield
## Min. :0.0100 Min. :0.0000 Min. :-0.0795
## 1st Qu.:0.1334 1st Qu.:0.1234 1st Qu.: 0.1157
## Median :0.1840 Median :0.1740 Median : 0.1616
## Mean :0.1928 Mean :0.1828 Mean : 0.1692
## 3rd Qu.:0.2500 3rd Qu.:0.2400 3rd Qu.: 0.2254
## Max. :0.4500 Max. :0.4325 Max. : 0.3199
## NA's :2548
## EstimatedLoss EstimatedReturn ProsperRating..numeric.
## Min. :0.0049 Min. :-0.0795 Min. :1.000
## 1st Qu.:0.0424 1st Qu.: 0.0741 1st Qu.:3.000
## Median :0.0712 Median : 0.0927 Median :4.000
## Mean :0.0800 Mean : 0.0966 Mean :4.077
## 3rd Qu.:0.1120 3rd Qu.: 0.1174 3rd Qu.:5.000
## Max. :0.3660 Max. : 0.2837 Max. :7.000
## NA's :2548 NA's :2548 NA's :2548
## ProsperRating..Alpha. ProsperScore ListingCategory..numeric.
## :2548 Min. : 1.000 Min. : 0.000
## C :1619 1st Qu.: 4.000 1st Qu.: 1.000
## B :1338 Median : 6.000 Median : 1.000
## A :1302 Mean : 6.003 Mean : 2.761
## D :1249 3rd Qu.: 8.000 3rd Qu.: 3.000
## E : 860 Max. :11.000 Max. :20.000
## (Other):1084 NA's :2548
## BorrowerState Occupation EmploymentStatus
## CA :1284 Other :2433 Employed :5883
## TX : 644 Professional :1187 Full-time :2303
## FL : 619 Computer Programmer : 364 Self-employed: 539
## NY : 532 Executive : 356 Not available: 466
## IL : 525 : 325 Other : 328
## : 497 Administrative Assistant: 317 : 210
## (Other):5899 (Other) :5018 (Other) : 271
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 False:4916 False:8922
## 1st Qu.: 26.00 True :5084 True :1078
## Median : 67.00
## Mean : 96.09
## 3rd Qu.:136.00
## Max. :648.00
## NA's :679
## GroupKey DateCreditPulled
## :8858 2013-11-01 16:35:12 : 2
## 783C3371218786870A73D20: 97 2013-11-02 08:55:31 : 2
## 3D4D3366260257624AB272D: 76 2014-02-02 21:53:27 : 2
## 6A3B336601725506917317E: 59 2014-02-18 11:22:58 : 2
## FEF83377364176536637E50: 52 2005-11-28 10:19:33.010000000: 1
## CD0E3364909037313F32874: 34 2005-11-28 16:16:35.077000000: 1
## (Other) : 824 (Other) :9990
## CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.0 Min. : 19.0
## 1st Qu.:660.0 1st Qu.:679.0
## Median :680.0 Median :699.0
## Mean :685.6 Mean :704.6
## 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :880.0 Max. :899.0
## NA's :60 NA's :60
## FirstRecordedCreditLine CurrentCreditLines OpenCreditLines
## : 66 Min. : 0.00 Min. : 0.000
## 1994-11-01 00:00:00: 22 1st Qu.: 6.75 1st Qu.: 6.000
## 1996-10-01 00:00:00: 22 Median :10.00 Median : 9.000
## 1990-04-01 00:00:00: 20 Mean :10.35 Mean : 9.296
## 1990-05-01 00:00:00: 19 3rd Qu.:13.00 3rd Qu.:12.000
## 1996-11-01 00:00:00: 19 Max. :52.00 Max. :51.000
## (Other) :9832 NA's :676 NA's :676
## TotalCreditLinespast7years OpenRevolvingAccounts
## Min. : 2.00 Min. : 0.000
## 1st Qu.: 17.00 1st Qu.: 4.000
## Median : 25.00 Median : 6.000
## Mean : 26.97 Mean : 7.012
## 3rd Qu.: 35.00 3rd Qu.: 9.000
## Max. :127.00 Max. :51.000
## NA's :66
## OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries
## Min. : 0.0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 113.0 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 268.0 Median : 1.000 Median : 4.000
## Mean : 396.8 Mean : 1.397 Mean : 5.538
## 3rd Qu.: 527.0 3rd Qu.: 2.000 3rd Qu.: 7.000
## Max. :10977.0 Max. :52.000 Max. :106.000
## NA's :66 NA's :113
## CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.0 1st Qu.: 0.000
## Median : 0.0000 Median : 0.0 Median : 0.000
## Mean : 0.5791 Mean : 934.1 Mean : 4.162
## 3rd Qu.: 0.0000 3rd Qu.: 0.0 3rd Qu.: 3.000
## Max. :31.0000 Max. :255963.0 Max. :99.000
## NA's :66 NA's :677 NA's :93
## PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance
## Min. : 0.0000 Min. :0.0000 Min. : 0
## 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 3085
## Median : 0.0000 Median :0.0000 Median : 8338
## Mean : 0.3179 Mean :0.0147 Mean : 17879
## 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.: 19457
## Max. :15.0000 Max. :4.0000 Max. :1433328
## NA's :66 NA's :676 NA's :676
## BankcardUtilization AvailableBankcardCredit TotalTrades
## Min. :0.0000 Min. : 0 Min. : 1.00
## 1st Qu.:0.3100 1st Qu.: 875 1st Qu.: 15.00
## Median :0.6100 Median : 4139 Median : 22.00
## Mean :0.5644 Mean : 11133 Mean : 23.36
## 3rd Qu.:0.8400 3rd Qu.: 13017 3rd Qu.: 30.00
## Max. :2.3600 Max. :292662 Max. :114.00
## NA's :676 NA's :672 NA's :672
## TradesNeverDelinquent..percentage. TradesOpenedLast6Months
## Min. :0.0000 Min. : 0.0000
## 1st Qu.:0.8200 1st Qu.: 0.0000
## Median :0.9400 Median : 0.0000
## Mean :0.8864 Mean : 0.8014
## 3rd Qu.:1.0000 3rd Qu.: 1.0000
## Max. :1.0000 Max. :17.0000
## NA's :672 NA's :672
## DebtToIncomeRatio IncomeRange IncomeVerifiable
## Min. : 0.0000 $50,000-74,999:2776 False: 787
## 1st Qu.: 0.1400 $25,000-49,999:2724 True :9213
## Median : 0.2200 $100,000+ :1575
## Mean : 0.2662 $75,000-99,999:1422
## 3rd Qu.: 0.3100 Not displayed : 688
## Max. :10.0100 $1-24,999 : 664
## NA's :772 (Other) : 151
## StatedMonthlyIncome LoanKey TotalProsperLoans
## Min. : 0 09303699897852595CD59DD: 2 Min. :1.000
## 1st Qu.: 3250 5E9337054508165362CD556: 2 1st Qu.:1.000
## Median : 4667 64A8370161790336267B379: 2 Median :1.000
## Mean : 5611 F86637075491079348E0575: 2 Mean :1.459
## 3rd Qu.: 6833 000537001363220451EA011: 1 3rd Qu.:2.000
## Max. :140417 00093662314540397D8EFEA: 1 Max. :7.000
## (Other) :9990 NA's :8062
## TotalProsperPaymentsBilled OnTimeProsperPayments
## Min. : 0.00 Min. : 0.0
## 1st Qu.: 9.00 1st Qu.: 9.0
## Median : 16.00 Median : 16.0
## Mean : 23.38 Mean : 22.7
## 3rd Qu.: 34.00 3rd Qu.: 32.0
## Max. :141.00 Max. :141.0
## NA's :8062 NA's :8062
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## Min. : 0.000 Min. :0.000
## 1st Qu.: 0.000 1st Qu.:0.000
## Median : 0.000 Median :0.000
## Mean : 0.628 Mean :0.056
## 3rd Qu.: 0.000 3rd Qu.:0.000
## Max. :33.000 Max. :8.000
## NA's :8062 NA's :8062
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## Min. : 1000 Min. : 0
## 1st Qu.: 3500 1st Qu.: 0
## Median : 6100 Median : 1741
## Mean : 8671 Mean : 2932
## 3rd Qu.:11364 3rd Qu.: 4234
## Max. :56494 Max. :23034
## NA's :8062 NA's :8062
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## Min. :-167.000 Min. : 0.0
## 1st Qu.: -35.000 1st Qu.: 0.0
## Median : -2.000 Median : 0.0
## Mean : -5.021 Mean : 156.5
## 3rd Qu.: 20.000 3rd Qu.: 0.0
## Max. : 214.000 Max. :2703.0
## NA's :8335
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## Min. : 0.00 Min. : 0.00 Min. : 16
## 1st Qu.:10.00 1st Qu.: 6.00 1st Qu.: 37510
## Median :15.00 Median :21.00 Median : 67989
## Mean :16.47 Mean :32.04 Mean : 69024
## 3rd Qu.:22.00 3rd Qu.:65.00 3rd Qu.:101087
## Max. :40.00 Max. :99.00 Max. :136378
## NA's :8503
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## Min. : 1000 2013-11-13 00:00:00: 41 Q4 2013:1241
## 1st Qu.: 3890 2013-10-16 00:00:00: 38 Q1 2014:1011
## Median : 6010 2014-01-22 00:00:00: 38 Q3 2013: 829
## Mean : 8303 2014-01-14 00:00:00: 33 Q2 2013: 658
## 3rd Qu.:12000 2014-02-19 00:00:00: 33 Q3 2012: 498
## Max. :35000 2013-09-24 00:00:00: 30 Q2 2012: 464
## (Other) :9787 (Other):5299
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## C70934206057523078260C7: 4 Min. : 0.0 Min. : 0
## 077435186242217874F4D0B: 3 1st Qu.: 130.3 1st Qu.: 1045
## 5BAA3507676872666AA5774: 3 Median : 216.4 Median : 2623
## 7AA03366669917702D0CF2B: 3 Mean : 271.1 Mean : 4175
## A94433662517143290CC132: 3 3rd Qu.: 372.3 3rd Qu.: 5494
## AE513536468130556EE5F83: 3 Max. :1808.8 Max. :40548
## (Other) :9981
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## Min. : 0.0 Min. : 0.0 Min. :-538.46
## 1st Qu.: 536.1 1st Qu.: 283.2 1st Qu.: -73.19
## Median : 1637.2 Median : 709.1 Median : -34.83
## Mean : 3082.4 Mean : 1092.2 Mean : -55.30
## 3rd Qu.: 4000.0 3rd Qu.: 1467.4 3rd Qu.: -14.19
## Max. :35000.0 Max. :15547.7 Max. : 32.06
##
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## Min. :-6221.32 Min. : 0.0 Min. : -269.2
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0.00 Median : 0.0 Median : 0.0
## Mean : -15.15 Mean : 745.5 Mean : 726.7
## 3rd Qu.: 0.00 3rd Qu.: 0.0 3rd Qu.: 0.0
## Max. : 0.00 Max. :24317.9 Max. :24317.9
##
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## Min. : 0.00 Min. :0.7014 Min. : 0.0000
## 1st Qu.: 0.00 1st Qu.:1.0000 1st Qu.: 0.0000
## Median : 0.00 Median :1.0000 Median : 0.0000
## Mean : 26.95 Mean :0.9985 Mean : 0.0473
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.: 0.0000
## Max. :11857.11 Max. :1.0000 Max. :24.0000
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. :0.000 Min. : 0.00 Min. : 1.00
## 1st Qu.:0.000 1st Qu.: 0.00 1st Qu.: 2.00
## Median :0.000 Median : 0.00 Median : 44.00
## Mean :0.019 Mean : 13.07 Mean : 81.99
## 3rd Qu.:0.000 3rd Qu.: 0.00 3rd Qu.: 118.00
## Max. :5.000 Max. :10000.00 Max. :1189.00
##
We will look at doing some simple data cleaning so as to facilitate the examination of the data.
Looking at the data, we note that ClosedDate is a factor and we will convert it into a date before splitting it into year, month and day.
First we convert the date into standard date time format then we use the separate function to create new variables based off the ClosedDate
Next we look to arrange the IncomeRange and LoanOriginationQuarter factors in a more intuitive order
Univariate
Looking at the distribution of Credit Grade of borrowers, we noticed that there seems to be a significant amount of borrowers who are not credit graded, stripping those borrowers away, we can see that most of the borrowers with Credit Grade are in C or D

It becomes clear why CreditGrade has significiant amount of blanks from the following graph. It seems like after Q4 2008, all the borrowers are no longer credit graded. This is verified via the second plot. Thus, there seems to be a structural change in the lending model post Q4 2008.

Given that there is a structural change in the lending model as applicants post Q4 2008 are no longer Credit Graded by the same measure, can explore and see if the model used to evaluate borrowers pre Q4 2008 and post Q4 2008 are significantly different and which is better.
We will include a new binary variable in the data set so as to show if the data is pre-2008 and post-2008
Credit quality can also be measured using percentage delinquency of original loan amount.
Removing all the individuals that have no delinquent amounts, we note that most of the delinquencies are less than 25% of the original loan amount. In fact, after that, the number of people drop off significantly.

A simple function to create bar plots with a given variable and data set.
create_bar <- function(dataset,
variable1,
xbreaks=1,
ylower=1,
yupper=1,
standardplot){
if (standardplot) {
return(ggplot(aes_string(x = variable1),
data = dataset) +
geom_bar() +
theme(axis.text.x = element_text(angle=90)))
} else {
minx <- round(min(dataset[,variable1],na.rm=TRUE), digits = 1)
maxx <- round(max(dataset[,variable1],na.rm=TRUE), digits = 2)
return(ggplot(aes_string(x = variable1),
data = dataset) +
geom_bar() +
scale_x_continuous(breaks=seq(0,maxx,xbreaks)) +
coord_cartesian(ylim=c(ylower,yupper)))
}
}
The year 2013 seems to be the peak year of borrowing with a sharp fall in 2014

We see higher loan closures in Feburary with a second spike in September. There tends to be an increase in loan closures towards the end of the quarter

22nd seems to be the most popular day for closure of loans.

Most borrowers are in the income range $25,000-49,999 and $50,000-74,999.

Majority of the incomes are verifiable

Majority of loans have no more than 300 investors.

BorrowerAPR generally starts concentrate between 10% to 30% with a dual peak at 15% and 23% regions.

Similarly, LenderYield generally starts concentrate between 5% to 30% with a dual peak at 15%, 23% and 30% regions.

California has the highest number of loan applicants followed by Florida and Texas.

Half of the sample borrowers are homeowners.

On average, borrowers have about 8 open credit lines.

Bank Card Utilization is very high and count increases with Bank card utilisation.

Highest borrower counts in Q4 2013 followed by Q1 2014 and Q3 2013.

Looking at the distribution of investors over both pre-2008 and post-2008, we noticed that they look rather similar.
For ease of comparison, I have scaled both sample sets using the same axis where we see only counts up to a 100 and a max of 800 investors.

Main Takeaways
- There is a structural change in credit lending standards in end of 2008.
- Most delinquencies are less than 25% of original amount.
- Most of the borrowers with Credit Grade are in C or D pre-2008
- Most borrowers are in the income range $25,000-74,999.
- Investors count distribution looks generally the same (both pre-2008 and post-2008)
- Average borrower has 8 credit lines.
Bivariate
Giving ourselves a general overview of the data to see if there are interesting relationships that we can explore in the data set, we will use ggpairs to plot out all the various relationships between variables.
As expected, we see strong correlation between what is the rate charged to the borrower and what is earned by the lender (after taking into account losses) and the picture is very much the same across the different states.

This is further confirmed but the correlation plot of LenderYield vs BorrowerAPR across each state
A1 <- loan_sample %>%
group_by(BorrowerState) %>%
mutate(COR = cor(LenderYield, BorrowerAPR))
ggplot(A1, aes( x=BorrowerState, y=COR ) ) +
geom_point(stat = "identity" ) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 497 rows containing missing values (geom_point).

Plotting the number of applications at each ClosingDay and breaking it down to each month, we notice something very interesting. The ClosedDay seems to spike at the start of the week and tend to taper off towards the end of week. Also, there tend to be spikes towards the end of the month.

Something interesting occured when we plot the bar graph of income range but facet on Credit Grade (excluding the post-2008 samples). The general distribution of income for different CreditGrade looks very similar with the peak coming at about $25,000-50,000 income range.
Furthermore,
Income range seems to have minimal impact on the CreditGrade of an individual. This is because there is an absence of skew across the different CreditGrade
There is a observably lower number of borrowers in the “$1-24,999” category as compared to the “$50,000-74,999” category which is slightly counterintuitive considering the trend from $25,000 to $100,000+

However, putting it beside that of post-2008, there seems to be obvious differences in the income profile of the borrowers.
There seems to be better data collection with income with individuals reporting as “Not displayed” altogether disappearing
More importantly, there is a slightly change in the income profile of the borrowers with higher income applicants utilising the credit facility more post 2008.

To confirm that the income we are using is reliable, we will take a look and make sure majority of the income is verified

We will look to remove the subset of incomes that cannot be verified
loan_sample_pre1<-subset(loan_sample_pre,
loan_sample_pre$IncomeVerifiable=='True')
loan_sample_post1<-subset(loan_sample_post,
loan_sample_post$IncomeVerifiable=='True')
Percentage of deliquent amount varies with income range and from the next plot, we do note that the shape of the histograms are rather similar.

We assume that BorrowerAPR and number of investors are independently determined. BorrowerAPR is determined strictly by the lending company and number of investors who are given a second look and decide what and where is the best to invest their funds independent of the BorrowerAPR the lending company charges (I.e. even if lending company charges a super low APR, if the borrower is considered risky by the investors, there will be very few investors)
From the scatter plot below, we note there seems to be a general trend in where more risky borrowers (as measured by their BorrowerAPR) gets less investors.

The income group of $25,000 to $74,999 are the main reasons why the average borrower have about 8 open credit lines whilst higher income individuals have slightly more open credit lines and lower income individuals have less.
There seems to be a relationship between number of open credit lines and income range.

Main Takeaways:
- Most income are verified.
- There is better data collection post 2008 where there are much less Not displayed income.
- Borrowers have a slightly different income profile n pre-2008 and post-2008.
- Pre-2008 most borrowers are in the “$25,000-49,999” or Not displayed income range.
- Post-2008 most borrowers are in the “$50,000-74,999” income range
- More risky borrowers (as measured by their BorrowerAPR) gets less investors.
- Shape of the histograms of percentage of delinquent amount looks similar across different income groups.
- Income seems to have minimal impact on the CreditGrade of an individual.
- The ClosedDay seems to spike at the start of the week and tend to taper off towards the end of week.
- Closures also tend to spike towards the end of the month.
- There is a strong correlation between BorrowerAPR and LenderYield.
- The picture is the same across all the different states
- The higher the income of an individual, the more likely there will be more open credit lines.
Multivariate Plots
There is generally more Open Credit lines for individuals post 2008 bar the top income group.
## loan_sample_pre$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 4.750 6.500 6.917 10.000 14.000 673
## --------------------------------------------------------
## loan_sample_pre$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 6.000 8.667 8.500 35.000
## --------------------------------------------------------
## loan_sample_pre$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000 6.500 7.446 10.000 26.000
## --------------------------------------------------------
## loan_sample_pre$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 5.000 5.368 7.000 14.000
## --------------------------------------------------------
## loan_sample_pre$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000 7.000 7.124 9.000 43.000
## --------------------------------------------------------
## loan_sample_pre$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.000 8.000 9.007 12.000 29.000
## --------------------------------------------------------
## loan_sample_pre$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.000 9.000 9.924 12.000 32.000
## --------------------------------------------------------
## loan_sample_pre$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.00 12.00 12.76 17.00 51.00
## loan_sample_post$IncomeRange: Not displayed
## NULL
## --------------------------------------------------------
## loan_sample_post$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 4.000 6.000 7.915 11.000 25.000
## --------------------------------------------------------
## loan_sample_post$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.000 10.000 8.222 11.000 16.000
## --------------------------------------------------------
## loan_sample_post$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.250 6.000 6.249 8.000 21.000
## --------------------------------------------------------
## loan_sample_post$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.000 8.000 8.214 11.000 34.000
## --------------------------------------------------------
## loan_sample_post$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.000 9.000 9.481 12.000 34.000
## --------------------------------------------------------
## loan_sample_post$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.00 9.00 10.06 13.00 36.00
## --------------------------------------------------------
## loan_sample_post$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 11.00 11.49 15.00 38.00

Another way to look at the same data, thorugh boxplots of displaced data so as to be able to see the outliers better

To understand this phenomenon of higher income individuals applying for the loan facility much more frequently, we look the following 2 boxplots. The amount of credit in the accounts are actually LOWER for the top 2 income groups post-2008 but on the contrary, it seems to be higher in the next 2 income groups ($25,000-49,999 and $50,000-74,999)

Though there are more open credit lines and applications for higher income individuals, the amount of credit available for them seem to have decreased
We have established that the average amount of credit available to the top 2 income ranges has decreased whilst that of the next 2 income range has increased. Looking at the amount of utilization, it seems like the amount of utilisation has not changed much for the top income brackets but has decreased for that of the next 3 income groups.
An altruistic individual might claim that borrowers in the lower income ranges has become more “discipline” in their utilisation of the loan facility post 2008 given the reduction in the utilisation of Bank card and supplemented by the fact that lenders have increased the amount of credit available to them.

To see if the statement above is true, we measure how discipline one is with his finances by his debt to income ratio and interestingly, it does not seem like thats the case. The mean DebtToIncomeRatio for almost all income groups have risen post 2008.
In fact, it seems like indivduals in the lower income ranges has become “less discipline”.
## loan_sample_pre1$IncomeRange: Not displayed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0900 0.1600 0.2092 0.2500 5.5600 4
## --------------------------------------------------------
## loan_sample_pre1$IncomeRange: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.09 0.14 0.19 0.19 0.24 0.29 5
## --------------------------------------------------------
## loan_sample_pre1$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 4
## --------------------------------------------------------
## loan_sample_pre1$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.1500 0.2700 0.4893 0.4300 10.0100
## --------------------------------------------------------
## loan_sample_pre1$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1450 0.2400 0.2599 0.3400 0.9200
## --------------------------------------------------------
## loan_sample_pre1$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1300 0.2000 0.2235 0.3000 0.8500
## --------------------------------------------------------
## loan_sample_pre1$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1100 0.1650 0.1854 0.2425 0.5900
## --------------------------------------------------------
## loan_sample_pre1$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.0900 0.1500 0.1686 0.2400 0.4200
## loan_sample_post1$IncomeRange: Not displayed
## NULL
## --------------------------------------------------------
## loan_sample_post1$IncomeRange: Not employed
## NULL
## --------------------------------------------------------
## loan_sample_post1$IncomeRange: $0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 1
## --------------------------------------------------------
## loan_sample_post1$IncomeRange: $1-24,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.2000 0.3400 0.5109 0.5300 10.0100
## --------------------------------------------------------
## loan_sample_post1$IncomeRange: $25,000-49,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1800 0.2700 0.2873 0.3700 1.5100
## --------------------------------------------------------
## loan_sample_post1$IncomeRange: $50,000-74,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.020 0.160 0.230 0.238 0.310 1.210
## --------------------------------------------------------
## loan_sample_post1$IncomeRange: $75,000-99,999
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1300 0.2000 0.2076 0.2700 0.7800
## --------------------------------------------------------
## loan_sample_post1$IncomeRange: $100,000+
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1200 0.1700 0.1768 0.2300 0.5300

The following scatter plot show that homeowners tend to have more revolving credit balance on their account on average and also tend to be of higher income.
This is confirmed by the boxplot next to it

A question that came to mind is that do homeowners tend to have more days delinquent given that they tend have higher revolving credit balances. From the plot below, it seems homeowners do seem more “responsible” and capable to service their debt with a lower average days of delinquency although the difference is surprisingly small (about a month)

The plot below shows Debt to Income Ratio against that of percentage of delinquent amount of different income ranges.
One thing that seems to stand out massively is that borrowers in the $1-24,999 seems to be the most vulnerable group bar the other 3 groups that we have minimal data in. Their debt to income ratio seems to be highly correlated to their percentage of delinquent amount. This issue is much less pronounced in the higher income group.
